chore(evals): Update model evaluations 2026-06-16 by rhacs-bot · Pull Request #138 · stackrox/stackrox-mcp

rhacs-bot · 2026-06-16T08:23:20Z

Automated weekly model evaluation update.

Models evaluated: gpt-5-mini
Date: 2026-06-16

This PR was automatically generated by the Model Evaluation workflow.

coderabbitai · 2026-06-16T08:23:38Z

📝 Walkthrough

Summary by CodeRabbit

Documentation
- Updated evaluation results reflecting 100% task completion (11/11 tasks passed)
- Previously failing tasks now passing with improved performance
- Updated token metrics for the latest evaluation run

Walkthrough

The docs/model-evaluation.md file is updated to replace the prior gpt-5-mini evaluation entry (dated 2026-05-26) with a new entry dated 2026-06-16, showing 11/11 tasks passing (100%), updated per-task pass/fail statuses for rhsa-not-supported and cve-nonexistent, and revised total token counts.

Changes

gpt-5-mini Evaluation Results Update

Layer / File(s)	Summary
Updated gpt-5-mini evaluation section `docs/model-evaluation.md`	Replaces the 2026-05-26 evaluation block with a 2026-06-16 block: overall pass rate updated to 11/11 (100%), `rhsa-not-supported` and `cve-nonexistent` tasks changed from failing to passing, and total input/output token counts updated.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: an automated update to model evaluation data for a specific date (2026-06-16).
Description check	✅ Passed	The description is directly related to the changeset, providing context about the automated weekly evaluation update for gpt-5-mini model.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chore/update-model-evaluation-2026-06-16

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov-commenter · 2026-06-16T08:28:19Z

❌ 2 Tests Failed:

Tests completed	Failed	Passed	Skipped
380	2	378	12

View the full list of 2 ❄️ flaky test(s)

::policy 1
Flake rate in main: 100.00% (Passed 0 times, Failed 46 times)
Stack Traces | 0s run time
- test violation 1
- test violation 2
- test violation 3

::policy 4
Flake rate in main: 100.00% (Passed 0 times, Failed 46 times)
Stack Traces | 0s run time
- testing multiple alert violation messages 1
- testing multiple alert violation messages 2
- testing multiple alert violation messages 3

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

docs/model-evaluation.md (1)
36-36: ⚠️ Potential issue | 🟠 Major

Clarify the actual task passing criterion — documentation at line 36 contradicts results table.

Line 36 states tasks pass when "all its assertions pass and the LLM judge approves." However, the results table shows rhsa-not-supported and cve-nonexistent marked as Pass despite failing the maxCalls assertion. Either the passing criterion at line 36 is incomplete, or the Result column should reflect the documented requirement of all assertions passing.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/model-evaluation.md` at line 36, Update the task passing criterion
statement at line 36 to accurately reflect the actual passing logic. The current
statement says all assertions must pass AND the LLM judge approves, but the
results table shows tasks like rhsa-not-supported and cve-nonexistent marked as
Pass despite failing the maxCalls assertion. Either clarify line 36 to document
the actual, more lenient passing criteria (if failing some assertions is
acceptable), or update the language to precisely explain which assertions are
required to pass versus which are optional for task completion.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@docs/model-evaluation.md`:
- Line 36: Update the task passing criterion statement at line 36 to accurately
reflect the actual passing logic. The current statement says all assertions must
pass AND the LLM judge approves, but the results table shows tasks like
rhsa-not-supported and cve-nonexistent marked as Pass despite failing the
maxCalls assertion. Either clarify line 36 to document the actual, more lenient
passing criteria (if failing some assertions is acceptable), or update the
language to precisely explain which assertions are required to pass versus which
are optional for task completion.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 13f07577-395c-4ef6-b99d-c085a2e2ec96

📥 Commits

Reviewing files that changed from the base of the PR and between 81ce9af and 3712f51.

📒 Files selected for processing (1)

docs/model-evaluation.md

github-actions · 2026-06-16T08:31:19Z

E2E Test Results

Commit: 3712f51
Workflow Run: View Details
Artifacts: Download test results & logs

=== Evaluation Summary ===

  ✗ cve-multiple (assertions: 2/3)
      one or more verification steps failed
      - ToolsUsed: Required tool not called: server=stackrox-mcp, tool=, pattern=get_deployments_for_cve
  ✓ list-clusters (assertions: 3/3)
  ✓ cve-cluster-does-exist (assertions: 3/3)
  ✓ cve-clusters-general (assertions: 3/3)
  ✓ cve-cluster-does-not-exist (assertions: 3/3)
  ✓ cve-detected-clusters (assertions: 3/3)
  ✓ rhsa-not-supported (assertions: 2/2)
  ✓ cve-detected-workloads (assertions: 3/3)
  ✓ cve-cluster-list (assertions: 3/3)
  ✓ cve-log4shell (assertions: 3/3)
  ~ cve-nonexistent (assertions: 2/3)
      - MaxToolCalls: Too many tool calls: expected <= 5, got 9

Tasks:      10/11 passed (90.91%)
Assertions: 30/32 passed (93.75%)
Tokens:     ~57423 (estimate - excludes system prompt & cache)
MCP schemas: ~12562 (included in token total)
Agent used tokens:
  Input:  14748 tokens
  Output: 21299 tokens
Judge used tokens:
  Input:  19240 tokens
  Output: 21529 tokens

Update model evaluations 2026-06-16

3712f51

rhacs-bot requested a review from janisz as a code owner June 16, 2026 08:23

coderabbitai Bot reviewed Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(evals): Update model evaluations 2026-06-16#138

chore(evals): Update model evaluations 2026-06-16#138
rhacs-bot wants to merge 1 commit into
mainfrom
chore/update-model-evaluation-2026-06-16

rhacs-bot commented Jun 16, 2026

Uh oh!

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Uh oh!

codecov-commenter commented Jun 16, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rhacs-bot commented Jun 16, 2026

Uh oh!

coderabbitai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Uh oh!

codecov-commenter commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 2 Tests Failed:

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 16, 2026

E2E Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

codecov-commenter commented Jun 16, 2026 •

edited

Loading